CRUXEval-output: by examples

Home   Doc/Code

Not solved by any model

There are 28 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
CRUXEval-output/112, CRUXEval-output/113, CRUXEval-output/129, CRUXEval-output/149, CRUXEval-output/163, CRUXEval-output/175, CRUXEval-output/177, CRUXEval-output/218, CRUXEval-output/229, CRUXEval-output/245, CRUXEval-output/250, CRUXEval-output/254, CRUXEval-output/259, CRUXEval-output/272, CRUXEval-output/280, CRUXEval-output/301, CRUXEval-output/307, CRUXEval-output/33, CRUXEval-output/340, CRUXEval-output/375, CRUXEval-output/445, CRUXEval-output/44, CRUXEval-output/469, CRUXEval-output/488, CRUXEval-output/581, CRUXEval-output/622, CRUXEval-output/640, CRUXEval-output/671

Problems solved by 1 model only

example_link model min_elo
CRUXEval-output/444 gpt-4-turbo-2024-04-09+cot 1508.026
CRUXEval-output/599 gpt-4-turbo-2024-04-09+cot 1508.026
CRUXEval-output/169 gpt-4-turbo-2024-04-09+cot 1508.026
CRUXEval-output/484 gpt-4-turbo-2024-04-09+cot 1508.026
CRUXEval-output/698 gpt-4-turbo-2024-04-09+cot 1508.026
CRUXEval-output/125 gpt-4-turbo-2024-04-09+cot 1508.026
CRUXEval-output/591 gpt-4-turbo-2024-04-09+cot 1508.026
CRUXEval-output/126 gpt-4-turbo-2024-04-09+cot 1508.026
CRUXEval-output/458 gpt-4-turbo-2024-04-09+cot 1508.026
CRUXEval-output/35 gpt-4-turbo-2024-04-09+cot 1508.026
CRUXEval-output/220 gpt-4-turbo-2024-04-09+cot 1508.026
CRUXEval-output/501 claude-3-opus-20240229+cot 1489.546
CRUXEval-output/317 claude-3-opus-20240229+cot 1489.546
CRUXEval-output/726 claude-3-opus-20240229+cot 1489.546
CRUXEval-output/391 claude-3-opus-20240229+cot 1489.546
CRUXEval-output/158 claude-3-opus-20240229+cot 1489.546
CRUXEval-output/631 gpt-4-0613+cot 1392.187
CRUXEval-output/5 gpt-4-0613+cot 1392.187
CRUXEval-output/310 gpt-4-0613+cot 1392.187
CRUXEval-output/799 gpt-4-0613+cot 1392.187
CRUXEval-output/438 gpt-4-0613 1283.246
CRUXEval-output/556 gpt-4-turbo-2024-04-09 1267.174
CRUXEval-output/211 gpt-4-turbo-2024-04-09 1267.174
CRUXEval-output/550 gpt-3.5-turbo-0613+cot 1116.281
CRUXEval-output/568 gpt-3.5-turbo-0613+cot 1116.281
CRUXEval-output/749 codellama-34b+cot 884.835
CRUXEval-output/499 mixtral-8x7b 855.530
CRUXEval-output/347 codellama-7b+cot 644.964
CRUXEval-output/571 codellama-7b+cot 644.964
CRUXEval-output/209 phi-1 610.177

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
CRUXEval-output/329 0.686 -0.358
CRUXEval-output/563 0.800 -0.322
CRUXEval-output/333 0.400 -0.301
CRUXEval-output/297 0.371 -0.286
CRUXEval-output/691 0.314 -0.262
CRUXEval-output/118 0.514 -0.258
CRUXEval-output/132 0.457 -0.245
CRUXEval-output/209 0.029 -0.239
CRUXEval-output/57 0.629 -0.238
CRUXEval-output/638 0.114 -0.236

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.